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Abstract 


We address the problem of communicating do¬ 
main knowledge from a user to the designer of 
a clustering algorithm. We propose a protocol 
in which the user provides a clustering of a rel¬ 
atively small random sample of a data set. The 
algorithm designer then uses that sample to come 
up with a data representation under which fc- 
means clustering results in a clustering (of the 
full data set) that is aligned with the user’s clus¬ 
tering. We provide a formal statistical model for 
analyzing the sample complexity of learning a 
clustering representation with this paradigm. We 
then introduce a notion of capacity of a class of 
possible representations, in the spirit of the VC- 
dimension, showing that classes of representa¬ 
tions that have finite such dimension can be suc¬ 
cessfully learned with sample size error bounds, 
and end our discussion with an analysis of that 
dimension for classes of representations induced 
by linear embeddings. 


1 INTRODUCTION 

Clustering can be thought as the task of automatically di¬ 
viding a set of objects into “coherent” subsets. This defi¬ 
nition is not concrete, but its vagueness allows it to serve 
as an umbrella term for a wide diversity of algorithmic 
paradigms. Clustering algorithms are being routinely ap¬ 
plied in a huge variety of fields. 

Given a dataset that needs to be clustered for some applica¬ 
tion, one can choose among a variety of different clustering 
algorithms, along with different pre-processing techniques, 
that are likely to result in dramatically different answers. 
It is therefore critical to incorporate prior knowledge about 
the data and the intended semantics of the clustering into 
the process of picking a clustering algorithm (or, clustering 
model selection). Regretfully, there seem to be no system¬ 


atic tool for incorporation of domain expertise for cluster¬ 
ing model selection, and such decisions are usually being 
made in embarrassingly ad hoc ways. This paper aims to 
address that critical deficiency in a formal statistical frame¬ 
work. 


We approach the challenge by considering a scenario in 
which the domain expert (i.e., the intended user of the clus¬ 
tering) conveys her domain knowledge by providing a clus¬ 
tering of a small random subset of her data set. For exam¬ 
ple, consider a big customer service center that wishes to 
cluster incoming requests into groups to streamline their 
handling. Since the data base of requests is too large to 
be organized manually, the service center wishes to em¬ 
ploy a clustering program. As the clustering designer, we 
would then ask the service center to pick a random sam¬ 
ple of requests, manually cluster them, and show us the 
resulting grouping of that sample. The clustering tool then 
uses that sample clustering to pick a clustering method that, 
when applied to the full data set, will result in a cluster¬ 
ing that follows the patterns demonstrated by that sample 
clustering. We address this paradigm from a statistical ma¬ 
chine learning perspective. Aiming to achieve generaliza¬ 
tion guaranties for such an approach, it is essential to intro¬ 
duce some inductive bias. We do that by restricting the 
clustering algorithm to a predetermined hypothesis class 
(or a set of concrete clustering algorithms). 


In a recent Dagstuhl workshop, Blum ( 201 4) proposed to 
do that by fixing a clustering algorithm, say fc-means, and 
searching for a metric over the data under which fc-means 
optimization yields a clustering that agrees with the train¬ 
ing sample clustering. One should note that, given any do¬ 
main set X, for any fc-partitioning P of X, there exists 
some distance function dp over X such that P is the op¬ 
timal fc-means clustering solution to the input (X. cipjj. 
Consequently, to protect against potential overfitting, the 
class of potential distance functions should be constrained. 
In this paper, we provide (apparently the first) concrete for¬ 
mal framework for such a paradigm, as well as a general¬ 
ization analysis of this approach. 


'This property is sometimes called fc-Richness 











In this work we focus on center based clustering - an im¬ 
portant class of clustering algorithms. In these algorithms, 
the goal is to find a set of “centers” (or prototypes), and 
the clusters are the Voronoi cells induced by this set of 
centers. The objective of such a clustering is to minimize 
the expected value of some monotonically increasing func¬ 
tion of the distances of points to their cluster centers. The 
k-means clustering objective is arguably the most popular 
clustering paradigm in this class. Currently, center-based 
clustering tools lack a vehicle for incorporating domain ex¬ 
pertise. Domain knowledge is usually taken into account 
only through an ad hoc choice of input data representa¬ 
tion. Regretfully, it might not be realistic to require the do¬ 
main expert to translate sufficiently elaborate task-relevant 
knowledge into hand-crafted features. 

As a model for learning representations, we assume that the 
user-desirable clustering can be approximated by first map¬ 
ping the sample to some Euclidean (or Hilbert) space and 
then performing /c-means clustering in the mapped space 
(or equivalently, replacing the input data metric by some 
kernel and performing center-based clustering with respect 
to that kernel). However, the clustering algorithm is sup¬ 
posed to learn a suitable mapping based on the given sam¬ 
ple clustering. 

The main question addressed in this work is that of the sam¬ 
ple complexity: what is the size of a sample, to be clustered 
by the domain expert, that suffices for finding a close-to- 
optimal mapping (i.e., a mapping that generalizes well on 
the test data)? Intuitively, this sample complexity depends 
on the richness of the class of potential mappings that the 
algorithm is choosing from. In standard supervised learn¬ 
ing, there are well established notions of capacity of hy¬ 
pothesis classes (e.g., VC-dimension) that characterize the 
sample complexity of learning. This paper aims to provide 
such relevant notions of capacity for clustering. 


1.1 Previous Work 


In practice, there are methods that use some forms of super¬ 
vision for clustering. These me thods are sometimes calle d 
“semi-supervised clustering” ( Basu et al. ( 2002. 20041) : 
Kulis et all i2009i) ). The most common method to convey 
such supervision is through a set of pairwise must/cannot- 
link constraints on the instances ( Wagstaff et al.l ( 2001 )). 
A common way of using such information is by chang¬ 
ing the objective of clust ering so that violations of these 
constr ai nts are pen a lized (Demiriz et al.l ( 19991) : Law et al. 


Basu et al. (120081) ). Another approach, which i 


is 


closer to ours, keeps the clustering optimization objective 
fixed, and instead, searches for a metric that best fits given 
constraints. The metric is learned based on some objective 


funct ion over metrics ( (King et all 2002; Al inanahi et al 


2008)), so that pairs of instances marked must-link will be 
close in the new metric space (and cannot-link pairs be con¬ 


sidered as far apart). The twoabove approaches can also be 
integrated ( Bilenko et al. (2004)). However, these objective 
functions are usually rather ad hoc. In particular, it is not 
clear in what sense they are compatible with the adopted 
clustering algorithm (such as k-means clustering). 

A different approach to the problem of communicating user 
expertise fo r the purpose of choos ing a clustering tool is 
discussed in i Ackerman et al, ( 20101) . They considered a set 
of properties, or requirements, for clustering algorithms, 
and investigated which of those properties hold for vari¬ 
ous algorithms. The user can then pick the right algorithm 
based on the requirements that she wants the algorithm to 
meet. However, to turn such an approach into a practically 
useful tool, one will need to come up with properties that 
are relevant to the end user of clustering -a goal that is still 
far from being reached. 

Statistical convergence rates of sample clustering to the 
optimal clustering, with respect to some data generating 
probability distribution, play a central role in our analysis. 
From that perspective, most relevant to our paper are re¬ 
sults that provide generalization bounds for k-means clus¬ 
tering. Ben-David (2007) proposed the first dimension- 
independent generalization bound for k-means_clustering 
based on compression techniques. Biau et al. (2008) tight- 
ene d this result by an analys is of Rademacher complex¬ 
ity. Maurer and Pontil ( 20101) investigated a more general 
framework, in which generalization bounds for k-means as 
well as other algorithms can be obtained. It should be noted 
that these results are about the standard clustering setup 
(without any supervised feedback), where the data repre¬ 
sentation is fixed and known to the clustering algorithm. 


1.2 Contributions 

Our first contribution is to provide a statistical framework 
to analyze the problem of learning representation for clus¬ 
tering. We assume that the expert has some implicit target 
clustering of the dataset in his mind. The learner however, 
is unaware of it, and instead has to select a mapping among 
a set of potential mappings, under which the result of k- 
means clustering will be similar to the target partition. An 
appropriate notion of loss function is introduced to quantify 
the success of the learner. Then, we define the analogous 
notion of PAC-learnabilit}0 for the problem of learning rep¬ 
resentation for clustering. 

The second contribution of the paper is the introduction of 
a combinatorial parameter, a specific notion of the capacity 
of the class of mappings, that determines the sample com¬ 
plexity of the clustering learning tasks. This combinatorial 
notion is a multivariate version of pseudo-dimension of a 
class of real-valued mappings. We show that there is uni¬ 
form convergence of empirical losses to the true loss, over 

2 PAC stands for the well known notion of “probably approxi¬ 
mately correct”, popularized bv IValianl (1984). 




































































any class of embeddings, T, at a rate that is determined 
by the proposed dimension of that T. This implies that any 
empirical risk minimization algorithm (ERM) will success¬ 
fully learn such a class from sample sizes upper bounded by 
those rates. Finally, we analyze a particular natural class - 
the class of linear mappings from R d2 to R dl - and show 
that a roughly speaking, sample size of is suffi¬ 

cient to guarantee an e-optimal representation. 

The rest of this paper is organized as follows: Section U 
defines the problem setting. Then in Section [3| we inves¬ 
tigate ERM-type algorithms and show that, “uniform con¬ 
vergence” is sufficient for them to work. Furthermore, this 
section presents the uniform convergence results and the 
proof of an upper bound for the sample complexity. Finally, 
we conclude in section [4] and provide some directions for 
future work. 

2 PROBLEM SETTING 

2.1 Preliminaries 


COST x (f,p) = t4t m | n H/( x ) “ Mill! (3) 

The k-means clustering algorithm finds the set of centers 
/i{, that minimize this cos{j. In other words, 

k x = arg min COST x (/, p) (4) 

n 

Also, for a partition C and mapping /, we can define the 
cost of clustering as follows. 

COST x (f, C) = ^2 mm £ ||/(ar) - pj\\l (5) 

' ' ie[k] ^ xeCi 

For a mapping / as above, let C x denote the /. -means clus¬ 
tering of X induced by /, namely 


Fet X be a finite domain set. A k-clustering of X is a 
partition of X into k subsets. If C is a /c-clustering, we 
denote the subsets of the partition by C \..... C \., therefore 
we have C = {Ci ,.., C*,}. Fet 7r fc denote the set of all 
permutations over [k] where [k] denotes {1, 2, k}. The 
clustering difference between two clusterings, C 1 and C 2 , 
with respect to X is defined by 


A X (C\C 2 ) = mm±-J2\Ci A $ 

ae 7T k A z ' 

1 1 z=l 


2 « 


(i) 


where |.| and A denote the cardinality and the symmetric 
difference of sets respectively. For a sample S C X, and 

to be a partition of 


C 1 (a partition of X), we define C 1 


S induced by C 1 , namely C 1 = {C\ Cl S,..., Cl (T 5}. 

Accordingly, the sample-based difference between two par¬ 
titions is defined by 


A s{C\C 2 ) 


A s^ 1 



( 2 ) 


Let / be a mapping from X to R d , and /.t = (fj, i,... /Xfe) be 
a vector of k centers in K d . The clustering defined by (/, fi) 
is the partition over X induced by the /i-Voronoi partition 
in R d . Namely, 


Cf(fi) = (Ci,... Cfc), where for all i, 

Ci = {x S X : \\f(x)-Hih < \\f{x)—fij || 2 for all j ^ i} 


C f x = Cf(n f x ) ( 6 ) 

The difference between two mappings fi and fa with re¬ 
spect to A' is defined by the difference between the result 
of k-means clustering using these mappings. Formally, 

A x (f 1 J 2 ) = A x (C f x i,C{?) (7) 

The following proposition shows the “/e-richness” property 
of k-means objective. 

Proposition 1. Let X be a domain set. For every k- 
clustering of X, C, and every d £ N + , there exist a map¬ 
ping g : X i —t WL d such that C 9 X = C. 

Proof. The mapping g can be picked such that it collapses 
each cluster Ci into a single point in R" (and so the image 
of X under mapping g will be just k single points in R”). 
The result of k-means clustering under such mapping will 
beC. □ 


In this paper, we investigate the transductive setup, 
where there is a given data set, known to the learner, 
that needs to be clustered. Clustering often occurs 
as a ta sk over some d ata generating distribution (e.g., 


Von Lu xbur g and Ben -David ( 20051) ). The current work 


can be readily extended to that setting. However, in that 
case, we assume that the clustering algorithm gets, on top 
of the clustered sample, a large unclustered sample drawn 
form that data generating distribution. 


The k-means cost of clustering X with a set of centers p — 3 \y e assume that the solution to k-means clustering is unique. 

{pi ,..., pk } and with respect to a mapping / is defined by We will elaborate about this issue in the next sections. 













2.2 Formal Problem Statement 


Let C* be the target fc-clustering of X. A (supervised) rep¬ 
resentation learner for clustering, takes as input a sample 


S C X and its clustering, C* 


, and outputs a mapping / 


from a set of potential mappings T. In the following, PAC 
stands for the notion of “probably approximately correct”. 


Definition 1. PAC Supervised Representation Learner for 
K-Means (PAC-SRLK) 


Let T be a set of mappings from X to R d . A represen¬ 
tation learning algorithm A is a PAC-SRLK with sample 
complexity mjr : (0, l) 2 i —> N with respect to T, if for 
every (e, (5) £ (0, l) 2 , every domain set X and every clus¬ 
tering of X, C*, the following holds: 


if S is a randomly (uniformly) selected subset of X of size 
at least <5), then with probability at least 1 — S 


A x (C*,C f x A ) < inf A x (C*,C f x ) + 




( 8 ) 


TERM algorithm goes over all mappings in T and selects 
the mapping which is the most consistent mapping with the 
given clustering: the mapping under which if we perform 
k-means clustering of X, the sample-based A-difference 
between the result and Y is minimized. 

Note that we are not studying this algorithm as a computa¬ 
tional tool; we only use it to show an upper bound for the 
sample complexity. 

Intuitively, this algorithm will work well when the empiri¬ 
cal A-difference and the true A-difference of the mappings 
in the class are close to each other. In this case, by min¬ 
imizing the empirical difference, the algorithm will auto¬ 
matically minimize the true difference as well. In order to 
formalize this idea, we define the notion of “representative¬ 
ness” of a sample. 

Definition 2. (e-Representative Sample) Let T be a class 
of mappings from X to R d . A sample S is e-representative 
with respect to T, X and the clustering C*, if for every 
f £ T the following holds 


where f a = A(S,C* 


), is the output of the algorithm. 


This can be regarded as a formal PAC framework to an¬ 
alyze the problem of learning representation for k-means 
clustering. The learner is compared to the best mapping in 
the class T. 


A natural question is providing bounds on the sample com¬ 
plexity of PAC-SRLK with respect to T. Intuitively, for 
richer classes of mappings, we need larger clustered sam¬ 
ples. Therefore, we need to introduce an appropriate no¬ 
tion of “capacity” for T and bound the sample complexity 
based on it. This is addressed in the next sections. 


\A x (C*,C f x ) - A s (C*,C f x ))\ < e (10) 

The following theorem shows that for the TERM algorithm 
to work, it is sufficient to supply it with a representative 
sample. 

Theorem 1. (Sufficiency of Uniform Convergence) Let T 
be a set of mappings from X to R d . If S is an %- 
representative sample with respect to X, T and C* then 

Ax(C*,C f x )<A x (C*,C f x ) + e (11) 


3 ANALYSIS AND RESULTS 

3.1 Empirical Risk Minimization 


;re f* = argmin /eJF A x {C*,C f x ) 


" ERM (S,C* ). 


and f 


I n order to prove an upper bound for the sample complexity Proof Using |-representativeness of S and the fact that / 

of representation learning for clustering, we need to con- is the empirical minimizer of the loss function, we have 

sider an algorithm, and prove a sample complexity bound 


for it. Here, we show that any ERM-type algorithm can be 
used for this purpose. Therefore, we will be able to prove 
an upper bound for the sample complexity of PAC-SRLK. 

A x (C*,C f x )<A s (C*,C f x ) + ^ 

(12) 

Let F be a class of mappings and X be the domain set. A 

TERN0 learner for F takes as input a sample S C X and 

<A s (C*,C f ;)+ e - 

(13) 

its clustering Y and outputs: 



A TERM (5) Y ) = arg min A s (C f x , Y) (9) 

<A x (C*,C f ;)+ e - + | 

(14) 

f&tF S 



Note that we call it transductive, because it is implicitly 
assumed that it has access to unlabeled dataset (i.e., X). A 

< A x (C*,C f x ) + e 

(15) 

4 TERM stands for Transductive Empirical Risk Minimizer 


□ 







Therefore, we just need to provide an upper bound for the 
sample complexity of uniform convergence: “how many 
instances do we need to make sure that with high probabil¬ 
ity our sample is e-representative?” 

3.2 Classes of Mappings with a Uniqueness Property 

In general, the solution to k-means clustering may not be 
unique. Therefore, the learner may end up with finding a 
mapping that corresponds to multiple different clusterings. 
This is not desirable, because in this case, the output of the 
learner will not be interpretable. Therefore, it is reason¬ 
able to choose the class of potential mappings in a way that 
it includes only the mappings under which the solution is 
unique. 


over X. Also, as argued above, this subset is the useful 
subset to work with. Therefore, in the rest of the paper, 
we investigate learning for classes with ( 77 , e)-uniqueness 
property. In the next section, we prove uniform conver¬ 
gence results for such classes. 

3.3 Uniform Convergence Results 

In Section 3.1, we defined the notion of e-representative 
samples. Also, we proved that if a TERM algorithm is fed 
with such a representative sample, it will work satisfacto¬ 
rily. The most technical part of the proof is then about the 
question “how large should be the sample in order to make 
sure that with high probability it is actually a representative 
sample?” 


In order to make this idea concrete, we need to define an 
appropriate notion of uniqueness. We use a notion similar 
to the one introduced by Balcan et al. ( 20091) with a slight 
modification^. 


Definition 3. ((r],e)-Uniqueness) We say that k-means 
clustering for domain X under mapping f : X H > has 
a (r],e)-unique solution, if every rj-optimal solution of the 
k-means cost is e-close to the optimal solution. Formally, 
the solution is (ip e)-unique if for every partition P that 
satisfies 


COST x (f, P) < COST x (f, C f x ) + 77 (16) 


In order to formalize this notion, let F be a set of mappings 
from a domain X to (0, 1 )"0. Define the sample complex¬ 
ity of uniform convergence, mJjF(e,5), as the minimum 
number m such that for every fixed partition C*, if S is 
a randomly (uniformly) selected subset of X with size to, 
then with probability at least 1 — <5, for all / £ F we have 

\A x (C*,C f x )-A s (C*,C f x )\ <e (18) 

The technical part of this paper is devoted to provide an 
upper bound for this sample complexity. 

3.3.1 Preliminaries 


would also satisfy 

A x(C f x ,P)<e (17) 

In the degenerate case where the optimal solution to k- 
means is not unique itself (and so C x is not well-defined), 
we say that the solution is not (ip e)-unique. 

It can be noted that the definition of ( 77 , e)-uniqueness not 
only requires the optimal solution to k-means clustering 
to be unique, but also all the “near-optimal” minimizers 
of the k-means clustering cost should be “similar”. This 
is a natural strengthening of the uniqueness condition, to 
guard against cases where there are 770 -optimizers of the 
cost function (for arbitrarily small 770 ) with totally different 
solutions. 

Now that we have a definition for uniqueness, we can de¬ 
fine the set of mappings for X under which the solution 
is unique. We say that a class of mappings F has ( 77 , e)- 
uniqueness property with respect to X, if every mapping in 
F has ( 77 , e)-uniqueness property over X. 

Note that given an arbitrary class of mappings F, we can 
find a subset of it that satisfies ( 77 , e)-uniqueness property 

3 Our notion is additive in both parameters rather than multi¬ 
plicative 


Definition 4. ( e-cover and covering number) Let F be a 
set of mappings from X to (0, l) n . A subset F C T is 
called an e-cover for F with respect to the metric d (.,.) if 
for every f € T there exists f £ F such that d(f , /) < e. 
The covering number, Jf(F, d , e) is the size of the smallest 
e-cover of F with respect to d. 

In the above definition, we did not specify the metric d. 
In our analysis, we are interested in the L\ distance with 
respect to X, namely: 

= lYi ll-M®) - h( x )h (!9) 

' ' xex 

Note that the mappings we consider are not real-valued 
functions, but their output is an 71 -dimensional vector. This 
is in contrast to the usual analysis used for learning real¬ 
valued functions. If /1 and fa are real-valued, then L\ dis¬ 
tance is defined by 

d^ifufa) = H l-M*) ~ / 2 ( x )I ( 2 °) 

' ' xGX 

6 In the analysis, for simplicity, we will assume that the set 
of mappings is a function to the bounded space (0, l) n wherever 
needed 











(24) 


We will prove sample complexity bounds for our prob¬ 
lem based on the L \ -covering number of the set of map¬ 
pings. However, it will be beneficial to have a bound based 
on some notion of capacity, similar to VC-dimension, as 
well. This will help in better understanding and easier 
analysis of sample complexity of different classes. While 
VC-dimension is defined for binary valued functions, we 
need a similar notion for functions with outputs in R™. For 
real-valued functions, we have such notion, called pseudo¬ 
dimension dPollard ( 1984 )1. 


Definition 5. (P seudo-Dimension) Let X be a set of func¬ 
tions from X to R. Let S = {xi,x 2 , ■ • •, x m } be a subset 
of X. Then S is pseudo-shattered by X if there are real 
numbers ri, 7 ' 2 , ■ ■ ■, r m such that for every b £ { 0 , 1 }”\ 
there is a function ft £ X with sgn(fb(xi) — rf) = bifor 
i £ [to]. Pseudo dimension of X, called PdimfX), is the 
size of the largest shattered set. 


It can be shown (e.g.. Theorem 18.4. in 
Anthony and Bartlett] ( 2009 )') that for a real-valued 
class F, if Pdim(F) < q then log Af(F,df i ,e) = 0{q) 
where OQ hides logarithmic factors of In the next 
sections, we will generalize this notion to Revalued 
functions. 


3.3.2 Reduction to Binary Hypothesis Classes 

Let fi, f 2 £ X be two mappings and a be a permutation 
over [A:]. Define the binary-valued function hf 1 ’-’ 2 (.) as fol¬ 
lows 


hf’^ 2 (x) 


1 xGU^fCfAC^) (21) 
0 otherwise 


Let Ilf be the set of all such functions with respect to T 
and cr. 


Hf = {hf^(.):f 1 J 2 eX} (22) 

Finally, let // F be the union of all Ilf over all choices of 
a. Formally, if 7 r is the set of all permutations over [k], then 

H t = U a ^Hf (23) 


V/i £ H jr ,\h{S)-h(X)\ < e 

then S will be e-representative with respect to X, i.e., for 
a ll fi, f 2 £ X we will have 

|A x(c£,c£) - As(c£,c£)| < e ( 25 ) 

Proof. 

\As(C f x \C f x 2 )-&x(C f x \C f x 2 )\ (26) 




V 1 1 xGX 


(27) 



(28) 


< 


max (hf'f 2 (S) — hf’^ 2 (X)) 


< e 


(29) 


□ 


The fact that is a class of binary-valued functions en¬ 
ables us to provide sample complexity bounds based on 
VC-dimension of this class. However, providing bounds 
based on VC-Dim(TT- 7r ) is not sufficient, in the sense that 
it is not convenient to work with the class II F . Instead, it 
will be nice if we can prove bounds directly based on the 
capacity of the class of mappings, X. In the next section, 
we address this issue. 


3.3.3 L] -Covering Number and Uniform 
Convergence 

The classes introduced in the previous section, II F and 
Hf, are binary hypothesis classes. Also, we have shown 
that proving a uniform convergence result for is suffi¬ 
cient for our purpose. In this section, we show that a bound 
on the L \ covering number of X is sufficient to prove uni¬ 
form convergence for . 


For a set S, and a binary function h(.), let h(S) = 
■pry Yf-eS h(x). We now show that a uniform convergence 
result with respect to II F is sufficient to have uniform con¬ 
vergence for the A-difference function. Therefore, we will 
be able to investigate conditions for uniform convergence 
of H t rather than the A-difference function. 


In Section 3.2, we argued that we only care about the 
classes that have ( 77 , e)-uniqueness property. In the rest of 
this section, assume that X is a class of mappings from X 
to ( 0 , l) n that satisfies ( 77 , e)-uniqueness property. 

Lemma 1. Let fi, f 2 £ X. Ifd Ll (f ll f 2 ) < J 2 then 
Aa-(/i,/ 2 ) < 2e 


Theorem 2. Let X be a domain set, X be a set of map- We leave the proof of this lemma for the appendix, and 

pings, and II F be defined as above. If S C X is such that present the next lemma. 

















Lemma 2. Let be defined as in the previous section. 
Then, 

A r(H^, dl , 2e) < k\M{F, d£, (30) 

Proof. Let F be the yj-cover corresponding to the cover¬ 
ing number J\f{F. Based on the previous lemma, 

is a 2e-cover for Hf 7 . But we have only fc! permuta¬ 
tions of [fc], therefore, the covering number for // F is at 
most fc! times larger than Ilf 7 . This proves the result. □ 

Basically, this means that if we have a small L-\ covering 
number for the mappings, we will have the uniform conver¬ 
gence result we were looking for. The following theorem 
proves this result. 

Theorem 3. Let F be a set of mappings with (rj, e)- 
uniqueness property. Then there for some constant a we 
have 


Pdim(F) = nma.xPdim(Fi) (33) 

i£[n] 

Proposition 2. Let F be a set of mappings form X to R"'. 
IfPdim(F) < q then log d* 1 , e) = 0(q) where 0() 
hides logarithmic factors. 

Proof. The result follows from the corresponding result for 
bounding covering number of real-valued functions based 
on pseudo-dimension mentioned in the preliminaries sec¬ 
tion. The reason is that we can create a cover by com¬ 
position of the --covers of all F t . However, this will at 
most introduce a factor of n in the logarithm of the cover¬ 
ing number. □ 

Therefore, we can rewrite the result of the previous section 
in terms of pseudo-dimension. 

Theorem 4. Let T he a class of mappings with (rj, e)- 
uniqueness property. Then 


«) < 0( l°g» + logW<.; ) + lo 8 (p ) 

(31) 


mrM)<o( t + raim( f ) + log( * ) ; 


where Of) hides logarithmic factors ofk and K 


(34) 


Proof. Following the previous lemma, if we have a small 
fcj-covering number for F , we will also have a small 
covering number for H J7 as well. But based on stan¬ 
dard uniform convergence theory, if a hypothesis class 
has small covering number, then it has uniform conver¬ 
gence property. More precisely, (e.g.. Theorem 17.1 in 
Anthony and Bartlett ( 2009 )) we have: 


nr log A/’(iT' F , df , 44) + log(jf) 

m%Z{eo,6)<0(-2—± - Ll 2 ’ 16 J (32) 


Applying Lemma 2 to the above proves the result. 


□ 


3.4 Sample Complexity of PAC-SRLK 

In Section 3.1, we showed that uniform convergence is suf¬ 
ficient for a TERM algorithm to work. Also, in the previous 
section, we proved a bound for the sample complexity of 
uniform convergence. The following theorem, which is the 
main technical result of this paper, combines these two and 
provides a sample complexity upper bound for PAC-SRLK 
framework. 

Theorem 5. Let F be a class of (t?, e )-unique mappings. 
Then the sample complexity of learning representation for 
k-means clustering with respect to F is upper bounded by 


3.3.4 Bounding L \-Covering Number 

In the previous section, we proved if the L\ covering num¬ 
ber of the class of mappings is bounded, then we will have 
uniform convergence. However, it is desirable to have a 
bound with respect to a combinatorial dimension of the 
class (rather than the covering number). Therefore, we will 
generalize the notion of pseudo-dimension for the class of 
mappings that take value in R ra . 

Let J 7 be a set of mappings form X to R n . For every map¬ 
ping / G F , define real-valued functions f \ ..... such 
that /( x) = (/i(x),..., f n (x)). Now let F t = {fi : f G 
F}. This means that fc'i. f 2 ,..., F rl are classes of real¬ 
valued functions. Now we define pseudo-dimension of F 
as follow. 


, n / „ r k + Pdim(F) + \og(\) 

mjr[e, 0) < 0 { - 5 -—) 


(35) 


where O hides logarithmic factors ofk and K 

The proof is done by combining Theorems 1 and 4. 

The following result shows an upper bound for the sample 
complexity of learning linear mappings (or equivalently, 
Mahalanobis metrics). 

Corollary 1. Let F be a set of ( 77 , e)-unique linear map¬ 
pings from R dl to M d2 . Then we have 


mjr(e,S) < 0{ 


k + d\d 2 + log(|-), 


(36) 














Proof. It is a standard result that the pseudo-dimension of 
a vector space of real-valued functions is just the dimen¬ 
sionality of the sjjace (in our case d\) (e.g.. Theorem 11.4 


Anthony and Bartlett (2009)). Also, based on our defini¬ 


tion of Pdim for' 
a factor of d 2 . 


“ 2 -valued functions, it should scale by 

□ 


CONCLUSIONS AND OPEN 
PROBLEMS 


Acknowledgments 

5 APPENDIX 

Proof of Lemma 1. Let T : X ^ (0,1)" be a set 
of mappings that have ( rj , e)-uniqueness property. Let 
fi, fi G T and / 2 ) < We need to prove that 

Ax(/i, f 2 ) < 2e. In order to prove this, note that due to 
triangular inequality, we have 


In this paper we provided a formal statistical framework for 
learning the representation (i.e., a mapping) for k-means 
clustering based on supervised feedback. The learner, un¬ 
aware of the target clustering of the domain, is given a clus¬ 
tering of a sample set. The learner’s task is then finding 
a mapping function / (among a class of mappings) under 
which the result of k-means clustering of the domain is as 
close as possible to the true clustering. This framework was 
called PAC-SRLK. 


A.y (A ,h)= A.y ( Cf 1 (/i*), C h (// 2 )) 

< A.y(C /i (^ /i ),C /i (p /2 ))+ 

A.y (C f Hv h ),C h {n h )) (37) 

Therefore, it will be sufficient to show that each of the A- 
terms above is smaller than e. We start by proving a useful 
lemma. 


A notion of e-representativeness was introduced, and it was 
proved that any ERM-type algorithm that has access to 
such a sample will work satisfactorily. Finally, a techni¬ 
cal uniform convergence result was proved to make sure 
that a large enough sample is (with high probability) e- 
representative. This was used to prove an upper bound for 
the sample complexity of PAC-SRLK based on covering 
numbers of the set of mappings. Furthermore, a notion of 
pseudo-dimension for the class of mappings was defined, 
and the sample complexity was upper bounded based on it. 


Note that in the anal ysis, the notion of ( rj, e)-uniqueness 
(similar to that of .B alcan et al. (2009)) was used and it was 
argued that it is reasonable to require the learner to output 
a mapping under which the solution is “unique” (because 
otherwise the output of k-means clustering would not be 
interpretable). Therefore, in the analysis, we assumed that 
the class of potential mappings has the ( 77 , e)-uniqueness 
property. 


It can be noted that we did not analyze the computational 
complexity of algorithms for PAC-SRLK framework. We 
leave this analysis to the future work. We just^ note that 


a similar notion of uniqueness proposed by Balcan et al. 


(2009) resulted in being able to efficiently solve the k- 
means clustering algorithm. 


One other observation is that representation learning can be 
regarded as a special case of metric learning; because for 
every mapping, we can define a distance function that com¬ 
putes the distance in the mapped space. In this light, we 
can make the problem more general by making the learner 
find a distance function rather than a mapping. This is 
more challenging to analyze, because we do not even know 
a generalization bound for center-based clustering under 
general distance functions. An open question will be pro¬ 
viding such general results. 


Lemma 3. Let / 1 , /2 eT and dL 1 (fi,f 2 ) < §• Let fj, be 
an arbitrary set ofk centers in (0,1)”. Then 


|COSTxifurf - COST x (f 2 ,n)\ < | 


Proof. 

\COST x (fi, fi) — COST x (f 2 , p,)\ 



||/ 2 (x )|| 2 —2 < MA/1-/2 > 

(40) 


1 

W\ 


y max 

x£X 


<fi 


f2, fl + f2 — 2 fJ.j > 


(41) 


< jYj E ll/> - All < f < 2 

1 1 x£X 


V 


(42) 

□ 


















Now we are ready to prove that the first A-term is smaller 
than e, i.e., Ay/CA 1 (/A 1 ), CA 1 (/A 2 )) < e. But to 
do so, we only need to show that COSTxifi , /A 2 ) — 
COSTxifi, /A 1 ) < 77 ; because in that case, due to ( 77 , e)- 
uniqueness property of A> the result will follow. Now, us¬ 
ing Lemma 3, we have 

COSTx (/1 , /A 2 ) - COSTx (/1 , /A 1 ) (43) 


JYi Yj ||AW - fl(x)\\ 2 

' ' x€X 

+ ]Y\ ^ ll/iW ~ m x \\ 2 
' ' xex 

+ jyT E 2 < AW - AW-AW -m x > 

' ' xGX 

- COSTxifi, (51) 


< ( COSTxifi, P h ) + |) ~ COSTxifi, p h ) (44) 


= mm{COST x (f 2 ,p)) - mm(COST x (fi,li.)) + ? 

fi fi z 

(45) 


11/2(2:) -/l (a:) 11 

I I a-eJf 

+ C05T x (A,m) 

+ jyt EHAW _ AWII 

I ' a;GX 

- COSTxifi, to) (52) 


< max iCOSTxif 2 , m) - COSTxifi, ft)) 4- ^ (46) 
m 2 


< 


V , V ^ 

2 2 “ ' 


(47) 


- ]Yj E HAW -AWII 

I I zeJt 

+ {COSTxifi, to) - COSTxifi, 1 * 2 )) 

(53) 


where in the first and the last line we used Lemma 3. 

Finally, we need to prove the second A-inequality, i.e., 
Ax(C-fr(/A 2 ),C^ 2 (/A 2 )) < e. Assume contrary. But 
based on ( 77 , e)-uniqueness property of we conclude that 
COSTxif 2 ,C^i^))-COSTxif 2 ,C^i^)) > 77 . In 
the following, we prove that this cannot be true, and hence 
a contradiction. 

Let m x = argmin^g ^/2 ||/i(rr) — Mo|| 2 - Then, based on 
the boundedness of A (x),f 2 (x) and we have: 


COSTx if 2 , C* (/A 2 )) - COST y (/ 2 , fA 2 {p h )) (48) 


7^7 E IIAW -m x || 2> ) -COSTxifi,to) (49) 

J I CCSJV / 


HAW - AW + AW - rn x \\ 2 ^j 


-COSTxifi, to) (50) 


677 77 

<-— <77 

- 12 2 — 


(54) 
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